Ideas and exercises come from https://r4ds.had.co.nz/transform.html
Additional notes by TCS
First, we load the tidyverse package and a dataset. This
data frame contains all 336,776 flights that departed from New York City
in 2013.
The Lahman baseball dataset is also used.
require(nycflights13)
require(tidyverse)
require(hms)
require(Lahman)
flights
NA
There are 5 key functions (“verbs”), plus a helper function, that do most data manipulation tasks in dplyr:
filter: pick observations by valuesarrange: reorder rowsselect: pick variables by namemutate: create new variables from existing ones, using
functionssummarise: collapse values into single onesgroup_by: change scope of a verb from the whole dataset
to individual groupsfilter() gives you a subset of rows based on
valuesfilterfilter() includes ONLY rows where the condition is TRUE;
it excludes both FALSE and NA values. If you want to preserve missing
values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
(biggerthan1 <- filter(df, x > 1))
(bigorNA <- filter(df, is.na(x) | x > 1))
arrange() is for sorting rowsarrange() are a dataframe and column
name(s).desc option reverses the order.desc()select() gives you a subset of columns by nameYou can name each column, or specify a range using a colon.
As with other R selections, you can omit certain columns using the minus
sign.
You can add multiple arguments to include more columns in the
selection.
select() does not have to use exact column matches.
starts_with("foo")ends_with("bar")contains("foobar")matches(some_regex)num_range("x", 1:3) matches x1, x2 and x3select()rename() is considered a variant of select() where you
take a column, change its name, and keep all other columns as well.everything() is a helper for select() that lets you
move one or a few columns to the beginning (left) of the table, while
retaining all other columns.mutate() is for creating new variables from oldtransmute() is a variant where you keep ONLY the newly
defined columns(flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
))
(flights_less_sml <- mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
))
(flights_even_more <- mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
(flights_new <- transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
))
mutate()Some examples:
* Arithmetic, logs (vectors recycle to match length)
* Modulus arithmetic: %/% (integer division) and %% (remainder), where x
== y * (x %/% y) + (x %% y)
* Surrounding values: lead(), lag()
* Cumulative values such as cumsum() and cumprod()
* With the Rcpproll package, rolling sums, etc.
* Ranking such as min_rank()
mutate()Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
*The times are in HHMM format, so straight subtraction does not work. We can convert to time objects, or use modulus arithmetic to convert to minutes and back to time.
We expect dep_delay = sched_dep_time - dep_time. However, for > 60 min difference, we have the same time issue, and solution, as in exercise 2.
rank() goes smallest to largest, gives decimals
(averages) for ties, 1, 2.5, 2.5, 4,… and puts NAs last by default, but
can change behavior using na.last.
rank(desc()) goes largest to smallest.rank() has different methods of handling ties,
producing all-integer output. First; last; random; max (max rank of all
ties so there might not be a #1); min (min rank of all tied elements “as
used in sports,” so you can “tie for 1st” – (1,1,3,4…)) – this is
min_rank(). first, last, random
produce all-unique ranks.rank() functions. You could use
arrange() and then row numbers.The 1:3 vector is recycled, so you get 4+1, 5+2, etc.
The usual functions, plus a few others. Angles are in radians.
cospi(x) and kin give the functions of pi times x, only for
x = multiples of 0.5
# The time is in HMM or HHMM format. Therefore to get the hours we divide by 100, and for minutes we take the remainder. (Did not deal with the times = 2400!)
(better_time <- mutate(flights, dep_min_since_midnight = (dep_time %/% 100) * 60 + dep_time %% 100, sched_dep_min_since_midnight = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100))
(compare <- select(flights, air_time, arr_time, dep_time) %>% mutate(spent_time = arr_time-dep_time))
# the times are in HHMM format, so straight subtraction does not work
(realtimes <- mutate(flights,
arr_time_hms = hms(NULL, arr_time %% 100, arr_time %/% 100),
dep_time_hms = hms(NULL, dep_time %% 100, dep_time %/% 100),
spent_time_hms = difftime(arr_time_hms, dep_time_hms, units = "mins")))
# mutate to add the ranking function (alternatively you could probably arrange and then take by row number)
# if you preserve or average ties, you could end up with more or fewer than 10 in the result. Therefore to get exactly 10 we will break ties randomly.
(arr_ranked <- mutate(flights, arr_delay_rank = rank(desc(arr_delay), ties.method = "random")))
(top_10_delay <- arr_ranked %>% filter(arr_delay_rank <= 10))
summarise() collapses groups into single values*When using summarise() you usually also want groups
created by group_by(). (If you didn’t have a group, you
could just call the desired functions on whole columns.)
*You can group by multiple levels
*If you need to remove a grouping use ungroup()
*Using the pipe we can quickly feed groups into summaries and do other useful tasks efficiently.
*Among tidyverse functions, only ggplot2 doesn’t work as well with pipes: “it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.” But it is used in some examples below.
# with pipe
(delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL"))
*Beware – if you have any missing values (NAs) in your data your
summary value will also be NA, unless you set
na.rm = TRUE.
mean(arr_delay[arr_delay > 0]) = the mean of all those
values > 0# plot delay vs distance
(p <- ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
)
# remove NAs when computing the summary
(delayed <- flights %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE)))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# remove NAs by removing data at the beginning
(not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
)
(true_delayed <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE)))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# which planes have the greatest delays?
(delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay)
))
# histogram of average delay times
(p1 <- ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10))
# remove NAs and cancelled
(delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
))
# scatterplot of average delay vs n (rotated histogram actually, but you can see more points)
# shows that the high averages are almost all based on few data points
(p3 <- ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10))
# filter out the points with low n before plotting
(p4 <- delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10))
# instead of averaging early and late flights together, start with the logical subset of the positive (late) delays
(not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# looking at standard deviation of distance to each destination
(spread_of_origins <- not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd)))
# When do the first and last flights leave each day?
(first_and_last <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
(first_last_2 <- not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time), # This works ONLY because the table is already ordered by dep_time!
last_dep = last(dep_time)
))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# another way to find the 1st and last per day
(rankfilt <- not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))) # range gives you the min and max values
You can group hierarchically and see summaries within summaries. You can ungroup if needed.
# group multiple levels
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
# ungroup
(daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n())) # all flights
NA
Here we continue digressing (IMO) on the relationship between variation and sample size, using examples from the Lahman baseball statistics package.
# Convert to a tibble so it prints nicely
(batting <- as_tibble(Lahman::Batting))
# batting average vs at bats per player
# visualize avg with the number of at-bats
(batters <- batting %>%
group_by(playerID) %>%
summarise(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
))
(p <- batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE))
# who are the "best" batters? The lucky ones? Note the at-bats for the top of the list
(high_avg <- batters %>%
arrange(desc(ba))
)
Now we get back to the flight data and try methods of counting.
# Which destinations have the most carriers?
(not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>% # unique values
arrange(desc(carriers)))
# how many values are entered?
(valid_arr_time <- sum(!is.na(not_cancelled$arr_time)))
[1] 327346
# or to get the fraction of valid values
(valid_arr_time_frac <- mean(!is.na(not_cancelled$arr_time)))
[1] 1
(valid_arr_time_frac2 <- mean(!is.na(flights$arr_time)))
[1] 0.9741282
# count() is a dplyr summary of a grouped table
(flights_per_dest <- not_cancelled %>%
count(dest))
# you can weight counts by another variable
(aircraft_miles <- not_cancelled %>%
count(tailnum, wt = distance))
NA
NA
To clarify these differences, use % of flights that are delayed; mean and standard deviation of delays including pos and neg; mean absolute value of delays > 0; median delay
Which is more important: arrival delay or departure delay?
To the passenger, arrival delay is most important. To the airport, departure delay is also important because other flights are affected.
count(): use
n()(is.na(dep_delay) | is.na(arr_delay) ) is slightly
suboptimal. Why? Which is the most important column?&)Group by carrier AND airport, look at % on-time flights, mean and median delay (>0 and total). Compare carrier in an airport to total for that airport, or (in case one predominates, check % flights from that carrier) the total for other carriers at the same airport.
It displays the largest groups at the top of the output. You could use it to do a quickie summary instead of 2 separate commands to find, e.g., the most delayed flights or whatever.
# ---- assess lateness in different ways
# distinguish early + late vs constantly late vs. nearly always on time;
#mean and standard deviation of delays; mean absolute value of delays; median delay
# count the flights per destination without count()
(flights %>% group_by(dest) %>% summarise(n()))
# get aircraft miles: group by the tailnumber and sum the miles
#Find the worst members of each group:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
# Find all groups bigger than a threshold:
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
# Standardise to compute per group metrics:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
Which plane (tailnum) has the worst on-time record?
check % on-time check % on-time within some interval
divide day into bins and summarise, or plot time vs % on-time
group by dest, sum delays (positive). 2nd question is not very clear – but i think they want (delay for a single flight)/(total delays for the destination). or (delay for all flights with that number)/(total delays for the destination).
sort by time, graph delay vs delay(lag)
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
most ontime, lowest average delay magnitude
group by tailnumber, sort by date(?), find earliest 1 hr delay, count flights by that plane with date/time < that value